-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Generate updated supported CSV files from plugin repo #847
Conversation
Signed-off-by: cindyyuanjiang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cindyyuanjiang !
Have couple of questions:
- Dependencies for this script(pandas) has to be specified ?
- What would be the function of jenkins job i.e this python script will update the supported*.csv files and create a new PR?
NA = 2 | ||
PS = 3 | ||
S = 4 | ||
CO = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CO is configured off, correct? So this takes precedence over S
if there is an Exec with CO in one version and S in another Spark version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nartal1 Thanks! I need to confirm what CO
means. The current script assumes CO
takes the highest precedence. I can change this once I get an answer.
Thanks @nartal1!
I added some comments in the script for dependencies. My original intention was to include this script for review/future development reference. Do we need to specifically specify them in a file like
The jenkins job would likely have the following process:
|
Signed-off-by: cindyyuanjiang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cindyyuanjiang for working on that new feature.
I haven't gotten the chance yet to review the PR, but I have one suggestion: It will be nice if the new python script accepts an input to override the data pulled from the plugin. For example we can have a config files in json format to describe some sort of explicit overriding we need to do do on the tools side.
Examples:
- Previously, the tools resource files added manually
promote_precision
as SQL func name toPromotePrecision
. the SQL func name did not exist in the Plugin files. It will be nice to have the configuration file doing that for us. - In some cases, users facing teams would ask to explicitly override a certain behavior for certain customers. These also can be part of the configuration files.
- Down the road, (Future improvement) the config file can be an optional input. This allows to change the behavior without the need to build a custom wheel-file or make a new release.
CCing @viadea , @kuhushukla to get their thoughts because this feature would be used frequently by Hao and Kuhu.
CC @mattahrens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cindyyuanjiang. Few minor questions.
def unify_all_files(root_dir, file_name, key_names): | ||
final_df = pd.DataFrame() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstrings should be inside the definition for all functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @parthosa! I will update this.
Signed-off-by: cindyyuanjiang <[email protected]>
Thanks @amahussein! I updated the python script to accept an optional configs file which overrides the final output, and also included the config file we need to use for tools. |
Signed-off-by: cindyyuanjiang <[email protected]>
What does |
CO in this context means Configured Off specifically for read formats. If the readformat is off by default in the plugin, it's marked as CO. Here Avro and Json are off by default, so we end up adding CO for these read formats.
JSON and AVRO formats disabled by default. |
Signed-off-by: cindyyuanjiang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cindyyuanjiang
- Can we have the generated files sorted alphabetically similar to what the plugin sources have?
- I have some concerns about committing the changes in the
CSV
files:- It is hard to tell if the new change overrides any manual change we did in those files. For example, if I haven't call out
promote_precision
, we would not see it. Is there another diff that can show us the change done to each operator? possible work around is to maintain the order of the columns so that the Diff won't be bogus. - There are new rows introduced here, but we have not done any audit to assess the handling of the new operators. Especially for Execs and some of the UDF expressions added in
exprs.csv
- It is hard to tell if the new change overrides any manual change we did in those files. For example, if I haven't call out
@cindyyuanjiang did you have any thoughts on how to address those issues?
CC @mattahrens
Thanks @amahussein!
I looked into the generated files in plugin repo. They are not sorted alphabetically either. If we need to follow the order strictly, I will spend more time on it. Now the order is generally the same, with few exceptions.
For the manual changes done in the tools side, they are currently not tracked (we can use the configs file The current bogus diff is due to two new columns in supportedExecs.csv:
in supportedExprs.csv:
The diff may still look bogus, but if we look closely, we can see the difference is mainly due to:
For the new Execs and Expressions, I should file follow up PRs to add the support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cindyyuanjiang
The point I was making in my previous comment that regardless of followups we cannot have the new CSV merged as they are without proper evaluation.
The PR is not just about adding the script to the repo. The expectation is to make it working end-to-end with confidence that the current behavior does not break.
If we need to follow the order strictly, I will spend more time on it.
I do not understand why it would take more development time to sort the final CSV file by its first column. The script already uses Pandas dataframes.
The current bogus diff is due to two new columns DAYTIME and YEARMONTH. If we remove these columns and compare with the old CSV files in plugin again, we see that
The diff file is incorrect. I will show just one case here, but there is possibly other entries. For example, the entry below "Empty2Null" :
@@ -196,2 +186,0 @@
-Empty2Null,S, ,None,project,input,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA
-Empty2Null,S, ,None,project,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA
However, Empty2Null
exists In the tools CSV file. Also, the override_supported_config has no entry to set the SQL function empty2null
spark-rapids-tools/core/src/main/resources/supportedExprs.csv
Lines 736 to 737 in b84d2ef
Empty2Null,S,`empty2null`,None,project,input,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA,NA,NA,NA,NA,NA,NA | |
Empty2Null,S,`empty2null`,None,project,result,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA |
longer SQL func values (date_add; dateadd vs date_add)
This is not 100% correct. Another reason why the diff is bogus is because the order within the SQL function has changed on the plugin side. For example, in Tools CSV we have =
; then ==
while on the plugin side we have ==
, followed by =
.
spark-rapids-tools/core/src/main/resources/supportedExprs.csv
Lines 193 to 198 in b84d2ef
EqualTo,S,`=`; `==`,None,project,lhs,S,S,S,S,S,S,S,S,PS,S,S,S,NS,NS,NS,NA,PS,NS | |
EqualTo,S,`=`; `==`,None,project,rhs,S,S,S,S,S,S,S,S,PS,S,S,S,NS,NS,NS,NA,PS,NS | |
EqualTo,S,`=`; `==`,None,project,result,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA | |
EqualTo,S,`=`; `==`,None,AST,lhs,S,S,S,S,S,NS,NS,S,PS,S,NS,NS,NS,NS,NS,NA,NS,NS | |
EqualTo,S,`=`; `==`,None,AST,rhs,S,S,S,S,S,NS,NS,S,PS,S,NS,NS,NS,NS,NS,NA,NS,NS | |
EqualTo,S,`=`; `==`,None,AST,result,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA |
The diff may still look bogus, but if we look closely, we can see the difference is mainly due to:
That's what is my initial comment was all about. What we need from this PR is to include a report of what is changing.
Obviously, diff
is not the way to do it.
This PR introduces a script written in python which is a fully fledged framework. We can use its capability to report the actual difference for each record regardless of the columns order, and order within cells.
There is no need to reinvent the wheel as python package such as "Pandas" has all those functionalities.
Suggestions to move forward
What I expect from this PR is to get a CSV file that I can instantly use to replace the current Tools CSV file with confidence that the Q's behavior is identical.
It does not make much sense that we split the logic of the operation between Jenkins and this script.
- Why then we would have this python script in the Github repo?
- How to test when we have the logic split in two different places?
- What if we faced a blocker to get the automated-PR implemented? Then we get stuck with non-End-to-End sync mechanism. If the logic is in one place in the github, then we should be able to run the script manually to update the CSV file without waiting for the Jenkins job is running (given all the complexity of how to the PR is going to work).
- In my comment, I described different enums specific to changes coming from the Tools side "TON/TOFF"
- The script should take an argument representing directory of the tools generated CSV file to be used as a reference to maintain the new changes.
- The script should generate a report with the following sections:
- "New entries": Entire new entries. Those entries should not be added automatically to the Tools CSV file.
- "Removed entries": 1- We define ML functions that were not defined in the plugin's side. 2-Remember that the plugin side drops support to specific Spark versions and some how an entry may disappear from the record. On the tools side, entries stick once they are added to the system because the tools support eventlogs back to Spark2.x
- "Modified entries": the changes in the support-level done to existing entries. Those are the entries that have been modified at the support-level scope compared to the existing Tools file.
- The above generated report is important because the dev can manually audit it to evaluate the "merge" is doing its job and that's new changes won't be introduced without notice.
- Optional: New entries can be added to the CSV file given that their support-level will be marked as
TOFF
or evenTNEW
(stands for newly added entry). Then, after proper testing, their support level can be restored. Note that this requires that the script appends them automatically to the 0verride-config files.
@nartal1 has good knowledge of operators and I suggest he can be a good source of assistance navigating through this since you can both meet in person.
NA = 1 | ||
NS = 2 | ||
CO = 3 | ||
PS = 4 | ||
S = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is difficult to know why a given operator is getting its level of support; especially that we want to override that support on-demand. For instance, when Hao wants to troubleshoot an operator, he won't be able to tell whether the value comes from the plugin, overridden by the tools, or coming from a specific version of Spark.
I suggest we define two other members:
TOFF
standing for "tools-OFF".TON
standing for "tools-ON".
This will make it easy for anyone to know that this operator is turned-on/off on the tools side.
- when we want to enforce disabling an operator then:
- Add an entry in the override-config file defining its support level to be "TOFF".
- Run the script to generate new CSV file which will have TOFF
- when we want to enforce enabling an operator then:
- Add an entry in the override-config file defining its support level to be "TON".
- Run the script to generate new CSV file which will have TON
This change requires that we mirror those two members on the Scala side as well.
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cindyyuanjiang !
This is going to be very useful script moving forward.
Fixes #846
spark-rapids repo added CSV files generation per Apache Spark version under
spark-rapids/tools/generated_files
.This PR syncs with the plugin and updated the existing CSV files under
spark-rapids-tools/core/src/main/resources
. The new files are generated using the python script included in this PR. The high-level approach is to take the union of the CSV files across all spark versions, which is explained in more detail in the python file.Follow up: